Nonlinear Models (ISL 7)

Biostat 212A

Author

Dr. Jin Zhou @ UCLA

Published

February 25, 2025

Credit: This note heavily uses material from the books An Introduction to Statistical Learning: with Applications in R (ISL2) and Elements of Statistical Learning: Data Mining, Inference, and Prediction (ESL2).

Display system information for reproducibility.

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.1    fastmap_1.2.0     cli_3.6.3        
 [5] tools_4.4.1       htmltools_0.5.8.1 rstudioapi_0.16.0 yaml_2.3.10      
 [9] rmarkdown_2.28    knitr_1.48        jsonlite_1.8.9    xfun_0.47        
[13] digest_0.6.37     rlang_1.1.4       evaluate_1.0.0   
import IPython
print(IPython.sys_info())
{'commit_hash': '22d6a1c16',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Users/jinjinzhou/.virtualenvs/r-reticulate/lib/python3.13/site-packages/IPython',
 'ipython_version': '8.31.0',
 'os_name': 'posix',
 'platform': 'macOS-15.3.1-arm64-arm-64bit-Mach-O',
 'sys_executable': '/Users/jinjinzhou/.virtualenvs/r-reticulate/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.13.0 (main, Oct  7 2024, 05:02:14) [Clang 16.0.0 '
                '(clang-1600.0.26.4)]'}

1 Overview

  • The truth is never linear! Or almost never!

    But often the linearity assumption is good enough.

  • When it’s not …

    • polynomials
    • step functions
    • spline
    • local regression, and
    • generalized additive models

    offer a lot of flexibility, without losing the ease and interpretability of linear models.

  • Main idea

    • To augment/replace the vector of inputs \(X\) with additional variables, which are transformations of \(X\), and then use linear models in this new space of derived input features.
    • For example, in regression problems, \(f(X) = \mbox{E}(Y\mid X)\) is modeled as a linear function of \(X\), but now to model is by transformation of \(X\), i.e., \(h_m(X)\).

\[ f(X) =\sum_{m=1}^M \beta_m h_m(X) \]

  • The beauty of this approach is that once the basis functions \(h_m\) have been determined, the models are linear in these new variables, and the fitting proceeds as for linear models.

2 Polynomial regression

In most of this lecture, consider \(p=1\) in the following examples. \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_d x_i^d + \epsilon_i. \]

# Plot wage ~ age, display order-4 polynomial fit
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point() + 
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 4)
    ) +
  labs(
    title = "Degree-4 Polynomial",
    x = "Age",
    y = "Wage (k$)"
    )

# Visualize wage ~ age, display order-4 polynomial fit
plt.figure()
sns.lmplot(
  data = Wage, 
  x = "age", 
  y = "wage", 
  # Order-4 polynomial regression
  order = 4,
  scatter_kws = {'alpha' : 0.1},
  height = 8
  ).set(
  xlabel = 'Age', 
  ylabel = 'Wage (k$)',
  title = 'Degree-4 Polynomial'
  );
plt.show()

  • Create new variables \(X_1 = X\), \(X_2 = X^2\), …, and then treat as multiple linear regression.

  • Not really interested in the coefficients; more interested in the fitted function values at any value \(x_0\): \[ \hat f(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0 + \hat{\beta}_2 x_0^2 + \hat{\beta}_3 x_0^3 + \hat{\beta}_4 x_0^4. \]

# poly(age, 4) constructs orthogonal polynomial of degree 1 to degree, all orthogonal to the constant
lmod <- lm(wage ~ poly(age, degree = 4), data = Wage)
summary(lmod)

Call:
lm(formula = wage ~ poly(age, degree = 4), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.707 -24.626  -4.993  15.217 203.693 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             111.7036     0.7287 153.283  < 2e-16 ***
poly(age, degree = 4)1  447.0679    39.9148  11.201  < 2e-16 ***
poly(age, degree = 4)2 -478.3158    39.9148 -11.983  < 2e-16 ***
poly(age, degree = 4)3  125.5217    39.9148   3.145  0.00168 ** 
poly(age, degree = 4)4  -77.9112    39.9148  -1.952  0.05104 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.91 on 2995 degrees of freedom
Multiple R-squared:  0.08626,   Adjusted R-squared:  0.08504 
F-statistic: 70.69 on 4 and 2995 DF,  p-value: < 2.2e-16
# poly(age, 4, raw = TRUE) procudes raw othogonal polynomial, which match Python
lmod <- lm(wage ~ poly(age, degree = 4, raw = TRUE), data = Wage)
summary(lmod)

Call:
lm(formula = wage ~ poly(age, degree = 4, raw = TRUE), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.707 -24.626  -4.993  15.217 203.693 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        -1.842e+02  6.004e+01  -3.067 0.002180 ** 
poly(age, degree = 4, raw = TRUE)1  2.125e+01  5.887e+00   3.609 0.000312 ***
poly(age, degree = 4, raw = TRUE)2 -5.639e-01  2.061e-01  -2.736 0.006261 ** 
poly(age, degree = 4, raw = TRUE)3  6.811e-03  3.066e-03   2.221 0.026398 *  
poly(age, degree = 4, raw = TRUE)4 -3.204e-05  1.641e-05  -1.952 0.051039 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.91 on 2995 degrees of freedom
Multiple R-squared:  0.08626,   Adjusted R-squared:  0.08504 
F-statistic: 70.69 on 4 and 2995 DF,  p-value: < 2.2e-16
  • Since \(\hat f(x_0)\) is a linear function of the \(\hat{\beta}_j\), we can get a simple expression for pointwise-variances \(\operatorname{Var}[\hat f(x_0)]\) at any value \(x_0\).

  • We either fix the degree \(d\) at some reasonably low value, or use cross-validation to choose \(d\).

  • Can do separately on several variables. Just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later).

  • Polynomial modeling can be done for generalized linear models (logistic regression, Poisson regression, etc) as well.

  • Caveat: polynomials have notorious tail behavior. Very bad for extrapolation.

Code
library(splines)

# Plot wage ~ age
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point(alpha = 0.25) + 
  # Polynomial regression with degree 14
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 14),
    color = "blue"
    ) +
  # Natural cubic spline
  geom_smooth(
    method = "lm",
    formula = y ~ ns(x, df = 14),
    color = "red"
    ) +  
  labs(
    title = "Natural cubic spline (red) vs polynomial regression (blue)",
    subtitle = "Both have df=15",
    x = "Age",
    y = "Wage (k$)"
    )

from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features of age predictor
poly_tf = make_column_transformer(
  (PolynomialFeatures(degree = 4, include_bias = False), ['age']),
  remainder = 'drop'
)

# Define pipeline and fit to Wage data
pipe = Pipeline(steps = [
  ("poly_tf", poly_tf),
  ("model", LinearRegression())
])

# Fit pipeline
X = Wage.drop('wage', axis = 1)
y = Wage.wage
pipe.fit(X, y)
Pipeline(steps=[('poly_tf',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=4,
                                                                     include_bias=False),
                                                  ['age'])])),
                ('model', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# R^2
pipe.score(X, y)
0.0862646729203429
# Plot
plt.figure()
ax = sns.scatterplot(
  data = Wage,
  x = 'age',
  y = 'wage',
  alpha = 0.1
);
sns.lineplot(
  x = Wage['age'],
  y = pipe.predict(X),
  ax = ax
).set(
  title = "Polynomial regression (order = 4)",
  xlabel = 'Age', 
  ylabel = 'Wage (k$)'
);
plt.show()

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit linear regression
lmod = smf.ols(formula = 'wage ~ np.vander(age, 5, increasing = True) - 1', data = Wage).fit()
lmod.summary()
OLS Regression Results
Dep. Variable: wage R-squared: 0.086
Model: OLS Adj. R-squared: 0.085
Method: Least Squares F-statistic: 70.69
Date: Tue, 25 Feb 2025 Prob (F-statistic): 2.77e-57
Time: 17:47:50 Log-Likelihood: -15315.
No. Observations: 3000 AIC: 3.064e+04
Df Residuals: 2995 BIC: 3.067e+04
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
np.vander(age, 5, increasing=True)[0] -184.1542 60.040 -3.067 0.002 -301.879 -66.430
np.vander(age, 5, increasing=True)[1] 21.2455 5.887 3.609 0.000 9.703 32.788
np.vander(age, 5, increasing=True)[2] -0.5639 0.206 -2.736 0.006 -0.968 -0.160
np.vander(age, 5, increasing=True)[3] 0.0068 0.003 2.221 0.026 0.001 0.013
np.vander(age, 5, increasing=True)[4] -3.204e-05 1.64e-05 -1.952 0.051 -6.42e-05 1.45e-07
Omnibus: 1097.594 Durbin-Watson: 1.960
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4965.521
Skew: 1.722 Prob(JB): 0.00
Kurtosis: 8.279 Cond. No. 5.67e+08


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.67e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
np.polyfit(Wage.age, Wage.wage, deg = 4)
array([-3.20383037e-05,  6.81068771e-03, -5.63859313e-01,  2.12455205e+01,
       -1.84154180e+02])

3 Piecewise polynomials (regression splines)

  • Instead of a single polynomial in \(X\) over its whole domain, we can rather use different polynomials in regions defined by knots. E.g., a piecewise cubic polynomial with a single knot at \(c\) takes the form \[ y_i = \begin{cases} \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i & \text{if } x_i < c \\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i & \text{if } x_i \ge c \end{cases}. \]

  • Better to add constraints to the polynomials, e.g., continuity.

  • Splines have the “maximum” amount of continuity.

3.1 Linear spline

  • A linear spline with knots at \(\xi_k\), \(k = 1,\ldots,K\), is a piecewise linear polynomial continuous at each knot.

  • We can represent this model as \[ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+1} b_{K+1}(x_i) + \epsilon_i, \] where \(b_k\) are basis functions:
    \[\begin{eqnarray*} b_1(x_i) &=& x_i \\ b_{k+1}(x_i) &=& (x_i - \xi_k)_+, \quad k=1,\ldots,K. \end{eqnarray*}\] Here \((\cdot)_+\) means positive part \[ (x_i - \xi_i)_+ = \begin{cases} x_i - \xi_k & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}. \]

3.2 Cubic splines

  • A cubic spline with knots at \(\xi_k\), \(k = 1,\ldots,K\), is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot.

  • Again we can represent this model with truncated power basis functions \[ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i, \] with \[\begin{eqnarray*} b_1(x_i) &=& x_i \\ b_2(x_i) &=& x_i^2 \\ b_3(x_i) &=& x_i^3 \\ b_{k+3}(x_i) &=& (x_i - \xi_k)_+^3, \quad k = 1,\ldots,K, \end{eqnarray*}\] where \[ (x_i - \xi_i)_+^3 = \begin{cases} (x_i - \xi_k)^3 & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}. \]

  • A cubic spline with \(K\) knots costs \(K+4\) parameters or degrees of freedom. That is \(4(K+1)\) polynomial coefficients minus \(3K\) constraints.

  • While the truncated power basis is conceptually simple, it is not too attractive numerically: powers of large numbers can lead to severe rounding problems. In practice, B-spline basis functions are preferred for their computational efficiency. See ESL Chapter 5 Appendix.

Code
from sklearn.preprocessing import SplineTransformer

# Cubic spline for age
X_age = np.array(X['age']).reshape(3000, 1)
x_plot = np.linspace(start = 15, stop = 85, num = 70)
X_plot = x_plot[:, np.newaxis]
bs_plot = SplineTransformer(
    degree = 3,
    # knots = np.array([25, 40, 60]).reshape(3, 1),
    n_knots = 5,
    extrapolation = 'continue',
    # include_bias = False
    ).fit(X_age).transform(X_plot)
    
ns_plot = SplineTransformer(
    degree = 3,
    # knots = np.array([25, 40, 60]).reshape(3, 1),
    n_knots = 5,
    extrapolation = 'linear',
    # include_bias = False
    ).fit(X_age).transform(X_plot)    
    
# Plot
fig, axes = plt.subplots(ncols = 2, figsize = (20, 6))
axes[0].plot(x_plot, bs_plot)
# axes[0].legend(axes[0].lines, [f"spline {n}" for n in range(4)])
axes[0].set_title("B-splines")

axes[1].plot(x_plot, ns_plot)
# axes[1].legend(axes[0].lines, [f"spline {n}" for n in range(8)])
axes[1].set_title("B-splines with linearity at boundary")
plt.show()

3.3 Natural cubic splines

  • Splines can have high variance at the outer range of the predictors.

  • A natural cubic spline extrapolates linearly beyond the boundary knots. This adds \(4 = 2 \times 2\) extra constraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline.

  • A natural spline with \(K\) knots has \(K\) degrees of freedom.

Code
library(splines)

# Plot wage ~ age
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point(alpha = 0.25) + 
  # Cubic spline
  geom_smooth(
    method = "lm",
    formula = y ~ bs(x, knots = c(25, 40, 60)),
    color = "blue"
    ) +
  # Natural cubic spline
  geom_smooth(
    method = "lm",
    formula = y ~ ns(x, knots = c(25, 40, 60)),
    color = "red"
    ) +  
  labs(
    title = "Natural cubic spline fit (red) vs cubic spline fit (blue)",
    x = "Age",
    y = "Wage (k$)"
    )

3.4 Knot placement

  • One strategy is to decide \(K\), the number of knots, and then place them at appropriate quantiles of the observed \(X\).

  • In practice users often specify the degree of freedom and let software choose the number of knots and locations.

4 Smoothing splines

  • Consider this criterion for fitting a smooth function \(g(x)\) to some data: \[ \text{minimize} \quad \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 \, dt. \]

    • The first term is RSS, and tries to make \(g(x)\) match the data at each \(x_i\).
    • The second term is a roughness penalty and controls how wiggly \(g(x)\) is. It is modulated by the tuning parameters \(\lambda \ge 0\).
      • The smaller \(\lambda\), the more wiggly the function, eventually interpolating \(y_i\) when \(\lambda = 0\).
      • As \(\lambda \to \infty\), the function \(g(x)\) becomes linear.
  • It can be shown that this problem has an explicit, finite-dimensional, unique minimizer which is a natural cubic spline with knots at the unique values of the \(x_i\).

  • The roughness penalty controls the roughness via \(\lambda\).

  • Smoothing splines avoid the knot-selection issue, leaving a single \(\lambda\) to be chosen.

  • The vector of \(n\) fitted values can be written as \(\hat{g}_\lambda = S_\lambda y\), where \(S_{\lambda}\) is an \(n \times n\) matrix (determined by the \(x_i\) and \(\lambda\)).

  • The effective degrees of freedom are given by \[ \text{df}_{\lambda} = \sum_{i=1}^n S_{\lambda,ii}. \] Thus we can specify df rather than \(\lambda\).

  • The leave-one-out (LOO) cross-validated error is given by \[ \text{RSS}_{\text{CV}}(\lambda) = \sum_{i=1}^n \left[ \frac{y_i - \hat{g}_\lambda(x_i)}{1 - S_{\lambda,ii}} \right]^2. \]

ggformula package supplies geom_spline function for displaying smoothing spline fits.

Code
library(ggformula)
library(splines)

# Plot wage ~ age
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point(alpha = 0.25) + 
  # Smoothing spline with df = 16
  geom_spline(
      df = 16,
      color = "red"
    ) +
  # Smoothing spline with GCV tuned df
  geom_spline(
    # df = 6.8,
    cv = TRUE,
    color = "blue"
    ) +
  labs(
    title = "Smoothing spline with df=16 (red) vs LOOCV tuned df=6.8 (blue)",
    x = "Age",
    y = "Wage (k$)"
    )
Warning in smooth.spline(data$x, data$y, w = weight, spar = spar, cv = cv, :
cross-validation with non-unique 'x' values seems doubtful

  • Application to logistic regression \[ \log\frac{\mathbf{P}(Y=1|X)}{\mathbf{P}(Y=0|X)} = f(X), \] therefore \[ \mathbf{P}(Y=1|X) = \frac{\exp(f(X))}{1 + \exp(f(X))} = p(x). \]

  • The penalized The log-likelihood criterion \[ \ell(f,\lambda) = \sum_{i=1}^n \left[ y_i \log p(x_i) + (1-y_i) \log (1-p(x_i)) \right] - \lambda \int f''(t)^2 \, dt. \]

  • The solution is a natural cubic spline with knots at the unique values of the \(x_i\).

5 Local regression

  • With a sliding weight function, we fit separate linear fits over the range of \(X\) by weighted least squares.

  • At \(X=x_0\), \[ \text{minimize} \quad \sum_{i=1}^n K(x_i, x_0) (y_i - \beta_0 - \beta_1 x_i)^2, \] where \(K\) is a weighting function that assigns heavier weight for \(x_i\) close to \(x_0\) and zero weight for points furthest from \(x_0\).

  • Locally weighted linear regression: loess function in R and lowess in Python.

  • Anecdotally, loess gives better appearance, but is \(O(N^2)\) in memory, so does not work for larger data sets.

  • While all of these choices make some difference, the most important choice is the span \(s\), which is the proportion of points used to compute the local regression at \(x_0\).

  • The span plays a role like that of the tuning parameter \(\lambda\) in smoothing splines: it controls the flexibility of the non-linear fit.

    • The smaller the value of \(s\), the more local and wiggly will be our fit; alternatively, a very large value of s will lead to a global fit to the data using all of the training observations.
    • Cross-validation to choose s, or just specify it directly.

6 Generalized additive model (GAM)

  • Generalized additive models (GAMs) allows for flexible nonlinearities in several variables, but retains the additive structure of linear models. \[ y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \cdots + f_p (x_{ip}) + \epsilon_i. \]

  • We can fit GAM simply using, e.g. natural splines.

  • Coefficients not that interesting; fitted functions are.

  • Can mix terms: some linear, some nonlinear, and use ANOVA to compare models.

  • Can use smoothing splines or local regression as well. In R: gam(wage ~ s(year; df = 5) + lo(age; span = :5) + education).

  • GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or interactions of the form (in R) ns(age, df = 5):ns(year, df = 5).

Natural splines for year and age.

gam_mod <- lm(
  wage ~ ns(year, df = 4) + ns(age, df = 5) + education,
  data = Wage
  )
summary(gam_mod)

Call:
lm(formula = wage ~ ns(year, df = 4) + ns(age, df = 5) + education, 
    data = Wage)

Residuals:
     Min       1Q   Median       3Q      Max 
-120.513  -19.608   -3.583   14.112  214.535 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   46.949      4.704   9.980  < 2e-16 ***
ns(year, df = 4)1              8.625      3.466   2.488  0.01289 *  
ns(year, df = 4)2              3.762      2.959   1.271  0.20369    
ns(year, df = 4)3              8.127      4.211   1.930  0.05375 .  
ns(year, df = 4)4              6.806      2.397   2.840  0.00455 ** 
ns(age, df = 5)1              45.170      4.193  10.771  < 2e-16 ***
ns(age, df = 5)2              38.450      5.076   7.575 4.78e-14 ***
ns(age, df = 5)3              34.239      4.383   7.813 7.69e-15 ***
ns(age, df = 5)4              48.678     10.572   4.605 4.31e-06 ***
ns(age, df = 5)5               6.557      8.367   0.784  0.43328    
education2. HS Grad           10.983      2.430   4.520 6.43e-06 ***
education3. Some College      23.473      2.562   9.163  < 2e-16 ***
education4. College Grad      38.314      2.547  15.042  < 2e-16 ***
education5. Advanced Degree   62.554      2.761  22.654  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 35.16 on 2986 degrees of freedom
Multiple R-squared:  0.293, Adjusted R-squared:  0.2899 
F-statistic:  95.2 on 13 and 2986 DF,  p-value: < 2.2e-16

Smoothing splines for year and age.

library(gam)

gam_mod <- gam(
  wage ~ s(year, 4) + s(age, 5) + education,
  data = Wage
  )
summary(gam_mod)

Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-119.43  -19.70   -3.33   14.17  213.48 

(Dispersion Parameter for gaussian family taken to be 1235.69)

    Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3689770 on 2986 degrees of freedom
AIC: 29887.75 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
             Df  Sum Sq Mean Sq F value    Pr(>F)    
s(year, 4)    1   27162   27162  21.981 2.877e-06 ***
s(age, 5)     1  195338  195338 158.081 < 2.2e-16 ***
education     4 1069726  267432 216.423 < 2.2e-16 ***
Residuals  2986 3689770    1236                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
            Npar Df Npar F  Pr(F)    
(Intercept)                          
s(year, 4)        3  1.086 0.3537    
s(age, 5)         4 32.380 <2e-16 ***
education                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(gam_mod, se = TRUE, col = "red")

from sklearn.preprocessing import OneHotEncoder, SplineTransformer

# Natural cubic spline features of year predictor
ns_tf = make_column_transformer(
  (SplineTransformer(
    n_knots = 4,
    # knots = 'quantile',
    degree = 3, 
    extrapolation = 'linear', # natural cubic spline
    # include_bias = False
    ), ['year']),
  (SplineTransformer(
    n_knots = 5,
    # knots = 'quantile',
    degree = 3, 
    extrapolation = 'linear', # natural cubic spline
    # include_bias = False
    ), ['age']),
  (OneHotEncoder(drop = 'first'), ['education']),
  remainder = 'drop'
)

# Define pipeline and fit to Wage data
pipe = Pipeline(steps = [
  ("ns_tf", ns_tf),
  ("model", LinearRegression())
])

# Fit pipeline
X = Wage.drop('wage', axis = 1)
y = Wage.wage
pipe.fit(X, y)
Pipeline(steps=[('ns_tf',
                 ColumnTransformer(transformers=[('splinetransformer-1',
                                                  SplineTransformer(extrapolation='linear',
                                                                    n_knots=4),
                                                  ['year']),
                                                 ('splinetransformer-2',
                                                  SplineTransformer(extrapolation='linear'),
                                                  ['age']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['education'])])),
                ('model', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# R^2
pipe.score(X, y)
0.2924478828190137
from statsmodels.gam.api import GLMGam, BSplines

# Create spline basis for year and age
x_spline = Wage[['year', 'age']]
bs = BSplines(x_spline, df = [4, 5], degree = [3, 3])

# Fit GAM
gam_mod = GLMGam.from_formula('wage ~ education', data = Wage, smoother = bs).fit()
gam_mod.summary()
Generalized Linear Model Regression Results
Dep. Variable: wage No. Observations: 3000
Model: GLMGam Df Residuals: 2988
Model Family: Gaussian Df Model: 11.00
Link Function: Identity Scale: 1238.8
Method: PIRLS Log-Likelihood: -14934.
Date: Tue, 25 Feb 2025 Deviance: 3.7014e+06
Time: 17:47:51 Pearson chi2: 3.70e+06
No. Iterations: 3 Pseudo R-squ. (CS): 0.3358
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 41.7250 5.268 7.921 0.000 31.401 52.049
education[T.2. HS Grad] 10.7413 2.431 4.418 0.000 5.977 15.506
education[T.3. Some College] 23.2067 2.563 9.056 0.000 18.184 28.229
education[T.4. College Grad] 37.8704 2.547 14.871 0.000 32.879 42.862
education[T.5. Advanced Degree] 62.4355 2.764 22.591 0.000 57.019 67.852
year_s0 6.7031 4.972 1.348 0.178 -3.041 16.448
year_s1 5.2035 4.237 1.228 0.219 -3.101 13.508
year_s2 7.5149 2.343 3.208 0.001 2.924 12.106
age_s0 26.8982 7.771 3.461 0.001 11.667 42.129
age_s1 64.4339 5.693 11.318 0.000 53.275 75.592
age_s2 33.2186 9.402 3.533 0.000 14.791 51.646
age_s3 27.9161 11.042 2.528 0.011 6.275 49.557
# Plot smooth components
for i in [0, 1]:
  plt.figure()
  gam_mod.plot_partial(i, cpr = True)
  plt.show()

7 Lab

7.1 Polynomial Regression and Step Functions

  • To predict wage using a fourth-degree polynomial in age: poly(age, 4) in lm
  • poly function creates orthogonal polynomials, which are uncorrelated and have mean zero. Essentially it means that each column is a linear combination of the variables age, age^2, age^3 and age^4.
  • Or we can also use poly() to obtain age, age^2, age^3 and age^4 directly, if we prefer. We can do this by using the raw = TRUE argument to the poly() function.
attach(Wage)
fit <- lm(wage ~ poly(age, 4), data = Wage)
coef(summary(fit))
                Estimate Std. Error    t value     Pr(>|t|)
(Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02
fit2 <- lm(wage ~ poly(age, 4, raw = T), data = Wage)
coef(summary(fit2))
                            Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)            -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
poly(age, 4, raw = T)1  2.124552e+01 5.886748e+00  3.609042 0.0003123618
poly(age, 4, raw = T)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
poly(age, 4, raw = T)3  6.810688e-03 3.065931e-03  2.221409 0.0263977518
poly(age, 4, raw = T)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498
# Equilvalent to
fit3 <- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage)
  • In performing a polynomial regression we must decide on the degree of the polynomial to use. One way to do this is by using hypothesis tests.
fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage) 
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage) 
fit.5 <- lm(wage ~ poly(age, 5), data = Wage) 
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
Analysis of Variance Table

Model 1: wage ~ age
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1   2998 5022216                                    
2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
3   2996 4777674  1     15756   9.8888  0.001679 ** 
4   2995 4771604  1      6070   3.8098  0.051046 .  
5   2994 4770322  1      1283   0.8050  0.369682    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • As an alternative to using hypothesis tests and ANOVA, we could choose the polynomial degree using cross-validation, as discussed in Chapter 5.

  • We can also use step functions to fit a piecewise-constant function. The cut function is used to create a qualitative variable that represents the age range. The lm function can then be used to fit a step function to the age variable.

fit <- glm(I(wage > 250) ~ poly(age, 4), data = Wage, family = binomial)
  • Once again, we make predictions using the predict() function.
agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2])
preds <- predict(fit, newdata = list(age = age.grid), se = T)
preds <- predict(fit, newdata = list(age = age.grid), type = "response", se = T)
  • In order to fit a step function, as discussed in Section 7.2, we use the cut() function.
fit <- lm(wage ~ cut(age, 4), data = Wage)
summary(fit)

Call:
lm(formula = wage ~ cut(age, 4), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.126 -24.803  -6.177  16.493 200.519 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)              94.158      1.476  63.790   <2e-16 ***
cut(age, 4)(33.5,49]     24.053      1.829  13.148   <2e-16 ***
cut(age, 4)(49,64.5]     23.665      2.068  11.443   <2e-16 ***
cut(age, 4)(64.5,80.1]    7.641      4.987   1.532    0.126    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.42 on 2996 degrees of freedom
Multiple R-squared:  0.0625,    Adjusted R-squared:  0.06156 
F-statistic: 66.58 on 3 and 2996 DF,  p-value: < 2.2e-16

7.2 Splines

  • In order to fit regression splines in R, we use the splines library.
  • The bs() function generates the entire matrix of bs() basis functions for splines with the specified set of knots.
library(splines)
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
  • The df option to bs() can be used in order to produce a spline with a specified degrees of freedom.
dim(bs(age, knots = c(25, 40, 60))) 
[1] 3000    6
dim(bs(age, df = 6))
[1] 3000    6
attr(bs(age, df = 6), "knots") 
[1] 33.75 42.00 51.00
  • In order to instead fit a natural spline, we use the ns() function. Here ns() we fit a natural spline with four degrees of freedom.
fit <- lm(wage ~ ns(age, df = 4), data = Wage)
summary(fit)

Call:
lm(formula = wage ~ ns(age, df = 4), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.737 -24.477  -5.083  15.371 204.874 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        58.556      4.235  13.827   <2e-16 ***
ns(age, df = 4)1   60.462      4.190  14.430   <2e-16 ***
ns(age, df = 4)2   41.963      4.372   9.597   <2e-16 ***
ns(age, df = 4)3   97.020     10.386   9.341   <2e-16 ***
ns(age, df = 4)4    9.773      8.657   1.129    0.259    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.92 on 2995 degrees of freedom
Multiple R-squared:  0.08598,   Adjusted R-squared:  0.08476 
F-statistic: 70.43 on 4 and 2995 DF,  p-value: < 2.2e-16
  • In order to fit a smoothing spline, we use the smooth.spline() function.
fit <- smooth.spline(age, wage, df = 16)
fit2 <- smooth.spline(age, wage, cv = TRUE)
Warning in smooth.spline(age, wage, cv = TRUE): cross-validation with
non-unique 'x' values seems doubtful
  • local regression
attach(Wage)
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey") > title("Local Regression")
logical(0)
fit <- loess(wage ~ age, span = .2, data = Wage)
fit2 <- loess(wage ~ age, span = .5, data = Wage)
lines(age.grid, predict(fit, data.frame(age = age.grid)), col = "red", lwd = 2)
lines(age.grid, predict(fit2, data.frame(age = age.grid)), col = "blue", lwd = 2)
legend("topright", legend = c("Span = 0.2", "Span = 0.5"), col = c("red", "blue"), lty = 1, lwd = 2, cex = .8)

7.3 GAMs

  • In order to fit a GAM in R, we use the gam function, which is part of the gam library.
  • The s() function, which is part of the gam library, is used to indicate that s() we would like to use a smoothing spline.
library(gam) 
gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
gam.m1 <- gam(wage ~ s(age, 5) + education, data = Wage) 
gam.m2 <- gam(wage ~ year + s(age, 5) + education, data = Wage)
anova(gam.m1, gam.m2, gam.m3, test = "F")
Analysis of Deviance Table

Model 1: wage ~ s(age, 5) + education
Model 2: wage ~ year + s(age, 5) + education
Model 3: wage ~ s(year, 4) + s(age, 5) + education
  Resid. Df Resid. Dev Df Deviance       F    Pr(>F)    
1      2990    3711731                                  
2      2989    3693842  1  17889.2 14.4771 0.0001447 ***
3      2986    3689770  3   4071.1  1.0982 0.3485661    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • We can also use local regression fits as building blocks in a GAM, using the lo() function.
gam.lo <- gam(wage ~ lo(year, age, span = 0.5) + education, data = Wage)
Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
liv too small.  (Discovered by lowesd)
Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv
too small.  (Discovered by lowesd)
Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
liv too small.  (Discovered by lowesd)
Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv
too small.  (Discovered by lowesd)
plot(gam.lo, se = TRUE, col = "green")